DAY 26 : scrapy facebook crawl （三）

第 11 屆 iThome 鐵人賽

DAY 26

AI & Data

蟲王養成 - scrapy系列第 26 篇

11th鐵人賽

kevin8701111

團隊NUTC_IMAC_GREEN

2019-10-12 18:04:16

2349 瀏覽

分享至

先前發文
DAY 01 : 參賽目的與規劃
 DAY 02 : python3 virtualenv 建置
 DAY 03 : python3 request
DAY 04 : 使用beautifulsoup4 和lxml
DAY 05 : select 和find 抓取tag
DAY 06 : soup解析後 list取值
 DAY 07 : request_header_cookie 通過網頁18限制
 DAY 08 : ppt內文爬取
 DAY 09 : 資料處理 split replace strip
DAY 10 : python csv 寫入和dict 合併
 DAY 11 : python class function
DAY 12 : crawl 框架 scrapy 使用
 DAY 13 : scrapy 架構
 DAY 14 : scrapy pipeline data insert mongodb
DAY 15 : scrapy middleware proxy
DAY 16 : scrapy selenium
DAY 17 : scrapy 爬取js畫面資料(二)
DAY 18 : scrapy splash 爬取js畫面資料(三)
DAY 19 : python .env 使用
 DAY 20 : python chartify 資料視覺化套件
 DAY 21 : python3 pandas 資料處理
 DAY 22 : scrapy 資料應用apriori
DAY 23 : Datamining twitch data
DAY 24 : scrapy facebook crawl （一）
DAY 25 : scrapy facebook crawl （二）

今天來介紹fb的內文

    def parse_page(self, response):
        # pass
#         '''
#         parse page does multiple things:
#             1) loads replied-to-comments page one-by-one (for DFS)
#             2) retrieves not-replied-to comments
#         '''
        #loads replied-to comments pages
        path = './/div[string-length(@class) = 2 and count(@id)=1 and contains("0123456789", substring(@id,1,1)) and .//div[contains(@id,"comment_replies")]]'  + '['+ str(response.meta['index']) + ']'
        for reply in response.xpath(path):
            source = reply.xpath('.//h3/a/text()').extract()
            answer = reply.xpath('.//a[contains(@href,"repl")]/@href').extract()
            ans = response.urljoin(answer[::-1][0])
            self.logger.info('{} nested comment @ page {}'.format(str(response.meta['index']),ans))
            yield scrapy.Request(ans,
                                 callback=self.parse_reply,
                                 meta={'reply_to':source,
                                       'url':response.url,
                                       'index':response.meta['index'],
                                       'flag':'init'})
        #loads regular comments     
        if not response.xpath(path):
            path2 = './/div[string-length(@class) = 2 and count(@id)=1 and contains("0123456789", substring(@id,1,1)) and not(.//div[contains(@id,"comment_replies")])]'
            for i,reply in enumerate(response.xpath(path2)):
                self.logger.info('{} regular comment @ page {}'.format(i,response.url))
                new = ItemLoader(item=CommentsItem(),selector=reply)
                new.context['lang'] = self.lang           
                new.add_xpath('source','.//h3/a/text()')  
                new.add_xpath('text','.//div[h3]/div[1]//text()')
                new.add_xpath('date','.//abbr/text()')
                new.add_xpath('reactions','.//a[contains(@href,"reaction/profile")]//text()')
                new.add_value('url',response.url)
                yield new.load_item()
            
        #previous comments
        if not response.xpath(path):
            for next_page in response.xpath('.//div[contains(@id,"see_next")]'):
                new_page = next_page.xpath('.//@href').extract()
                new_page = response.urljoin(new_page[0])
                self.logger.info('New page to be crawled {}'.format(new_page))
                yield scrapy.Request(new_page,
                                     callback=self.parse_page,
                                     meta={'index':1})

DAY 25 : scrapy facebook crawl （二）

DAY 27 : python Django 建置

系列文

蟲王養成 - scrapy 共 30 篇

RSS系列文訂閱系列文

27 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

蟲王養成 - scrapy系列 第 26 篇

DAY 26 : scrapy facebook crawl （三）

尚未有邦友留言

標記使用者

蟲王養成 - scrapy系列第 26 篇